Time series is a sequence of observations taken sequentially in time. Our aim here, as discussed in the milestone report is-
"The developed prediction model can be utilized by the electrical utilities to effectively plan their energy generation operations and balance the demand with appropriate supply. An efficient forecast can prove very useful for the utilities in planning their day to day operations, meeting their customers’ energy demand, and avoiding excess generation of energy."
Basically we want to fit a ML model onto our energy consumption time series and use that to predict the future energy consumption. Our time series dataset has dependent variable 'SDGE' which represents the energy consumption of the SDGE region in MWH and also has independent variables such as temperature in deg F, cumulative PV installation till date in kW and holidays and since it's a time series I have added some additional time dependent features such as hour of the day, day of the week, day of the month, season, etc. We can use all these features to train our model and use it predict the future energy values.
Any time series data has the following components: ref link
Another important feature of most time series is that observations close together in time tend to be correlated (serially dependent). This feature is also called as auto-correlation and forms an important aspect of the conventional time series modeling techniques like ARIMA.
Modeling a time series involves a slightly different approach than regular ML problems. Here is a good link explaining the main differences.
In this notebook I have used the basic time series models like ARIMA and FB-prophet and then have extended my approach to include linear regression, random forests and XGBoost to see whether or not these linear and non-linear approaches can model our time series accurately.
At the end I have listed all the models with their error metrics based on which we can conclude which model/s perform the best. In any time series problem it is very important to define the window of prediction beforehand. Here I have tested the models on a future window of 1-hour, 1-week and 8 months. 1-hour ahead and 1-week ahead forecasting is easier and can be done using the lagged values of the energy consumption to increase the accuracy of the forecasts. Simple models like SARIMAX, Elastic net and Random Forest regression give around 98% R2 score and 1~2% MAPE using lag variables for short term 1-hour ahead and 1-week ahead forecasts. But if we need to forecast over a long term window (more than 1 month), we will see that FB Prophet and XGBoost (with Fourier terms for seasonalities) perform very well for longer forecast windows.
I have used the following forecasting metrics to measure the prediction performance of the models used: reference
R squared: coefficient of determination (this can be interpreted as the percentage of variance explained by the model), $(-\infty, 1]$
$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$
sklearn.metrics.r2_score
Mean Absolute Error: this has the same unit of measurment as the initial series, $[0, +\infty)$
$MAE = \frac{\sum\limits_{i=1}^{n} |y_i - \hat{y}_i|}{n}$
sklearn.metrics.mean_absolute_error
Root Mean Squared Error: the most commonly used metric that gives a higher penalty to large errors and vice versa, this too has the same unit of measurment as the initial series $[0, +\infty)$
$RMSE = \sqrt(\frac{1}{n}\sum\limits_{i=1}^{n} (y_i - \hat{y}_i)^2)$
np.sqrt(sklearn.metrics.mean_squared_error)
Mean Absolute Percentage Error: this is the same as MAE but is computed as a percentage, which is very convenient when you want to explain the quality of the model to management, $[0, +\infty)$
$MAPE = \frac{100}{n}\sum\limits_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{y_i}$
def mean_absolute_percentage_error(y_true, y_pred):
$\quad$ return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
Out of the 4 metrics, MAPE will be chosen to sort the models. MAPE helps us to understand the % error on the absolute value that can be expected from a model. For example, if the MAPE is 1% and the observed energy value is 4000 MW, we can say that our model on average can preidct accurately within a $+or- 40MW$ error band.
The error metrics for all the models will be comapred to a baseline of persistent forecast wherein simply the last year values are used for the forecast (when forecasting over longer window of >1 month).
Time series cross validation:
Cross-validation for time series is a bit different because time series have this temporal structure and one cannot randomly mix values in a fold while preserving this structure. With randomization, all time dependencies between observations will be lost. This is why we will have to use a more tricky approach in optimizing the model parameters such as "cross-validation on a rolling basis".
The idea is rather simple -- we train our model on a small segment of the time series from the beginning until some $t$, make predictions for the next $t+n$ steps, and calculate an error. Then, we expand our training sample to $t+n$ value, make predictions from $t+n$ until $t+2*n$, and continue moving our test segment of the time series until we hit the last available observation. As a result, we have as many folds as $n$ will fit between the initial training sample and the last observation. This can be established using the sklearn.model_selection's TimeSeriesSplit module.
Reference
Importing the required libraries and modules
# Import basic modules
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
from matplotlib import rcParams
# Import regression and error metrics modules
import statsmodels.api as sm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
# Import plotly modules to view time series in a more interactive way
import plotly.graph_objects as go
import pandas as pd
# Standard scaler for preprocessing
from sklearn.preprocessing import StandardScaler
# Importing time series split for cross validation
from sklearn.model_selection import TimeSeriesSplit
plt.style.use('bmh')
# special IPython command to prepare the notebook for matplotlib and other libraries
%matplotlib inline
#sns.set_style("whitegrid")
#sns.set_context("poster")
# Import the 5 years of hourly energy consumption data previously cleaned, explored, feature-engineered and...
# ..stored as hourly1418_energy_temp_PV.csv
sdge = pd.read_csv('hourly1418_energy_temp_PV.csv', index_col = 'Dates', parse_dates=['Dates', 'Date'])
sdge.head()
Defining basic functions like calculating error metrics, plotting predicted vs original time series, etc. so that we can use the same functions seamlessly across different models and this will also allow us to compare different models using the same metrics.
# Creating an empty dict to save all the erros from different models
dict_error = dict()
# creating function for plotting predicted vs actual energy values
def plot_predvstrue_reg(pred, truth, model_name=None):
"""
Plots the observed energy consumption against the predicted energy consumption
"""
fig, ax = plt.subplots(1,1, figsize=(8,8))
ax.scatter(truth, pred)
_ = plt.xlabel("Observed energy in MWH")
_ = plt.ylabel("Predicted energy in MWH")
_ = plt.title("Observed vs Predicted energy using model {}".format(model_name))
_ = plt.xlim(1000, 5000)
_ = plt.ylim(1000, 5000)
#plotting 45 deg line to see how the prediction differs from the observed values
x = np.linspace(*ax.get_xlim())
_ = ax.plot(x, x)
def error_metrics(y_pred, y_truth, model_name = None, test = True):
"""
Printing error metrics like RMSE (root mean square error), R2 score,
MAE (mean absolute error), MAPE (mean absolute % error).
y_pred: predicted values of y using the model model_name
y_truth: observed values of y
model_name: name of the model used for predictions
test: if validating on test set, True; otherwise False for training set validation
The function will print the RMSE, R2, MAE and MAPE error metrics for the model_name and also store the results along with
model_name in the dictionary dict_error so that we can compare all the models at the end.
"""
if isinstance(y_pred, np.ndarray):
y_pred = y_pred
else:
y_pred = y_pred.to_numpy()
if isinstance(y_truth, np.ndarray):
y_truth = y_truth
else:
y_truth = y_truth.to_numpy()
print('\nError metrics for model {}'.format(model_name))
RMSE = np.sqrt(mean_squared_error(y_truth, y_pred))
print("RMSE or Root mean squared error: %.2f" % RMSE)
# Explained variance score: 1 is perfect prediction
R2 = r2_score(y_truth, y_pred)
print('Variance score: %.2f' % R2 )
MAE = mean_absolute_error(y_truth, y_pred)
print('Mean Absolute Error: %.2f' % MAE)
MAPE = (np.mean(np.abs((y_truth - y_pred) / y_truth)) * 100)
print('Mean Absolute Percentage Error: %.2f %%' % MAPE)
# Appending the error values along with the model_name to the dict
if test:
train_test = 'test'
else:
train_test = 'train'
#df = pd.DataFrame({'model': model_name, 'RMSE':RMSE, 'R2':R2, 'MAE':MAE, 'MAPE':MAPE}, index=[0])
name_error = ['model', 'train_test', 'RMSE', 'R2', 'MAE', 'MAPE']
value_error = [model_name, train_test, RMSE, R2, MAE, MAPE]
list_error = list(zip(name_error, value_error))
for error in list_error:
if error[0] in dict_error:
# append the new number to the existing array at this slot
dict_error[error[0]].append(error[1])
else:
# create a new array in this slot
dict_error[error[0]] = [error[1]]
#return(dict_error)
def plot_timeseries(ts, title = 'og', opacity = 1):
"""
Plot plotly time series of any given timeseries ts
"""
fig = go.Figure()
fig.add_trace(go.Scatter(x = ts.index, y = ts.values, name = "observed",
line_color = 'lightslategrey', opacity = opacity))
fig.update_layout(title_text = title,
xaxis_rangeslider_visible = True)
fig.show()
def plot_ts_pred(og_ts, pred_ts, model_name=None, og_ts_opacity = 0.5, pred_ts_opacity = 0.5):
"""
Plot plotly time series of the original (og_ts) and predicted (pred_ts) time series values to check how our model performs.
model_name: name of the model used for predictions
og_ts_opacity: opacity of the original time series
pred_ts_opacity: opacity of the predicted time series
"""
fig = go.Figure()
fig.add_trace(go.Scatter(x = og_ts.index, y = np.array(og_ts.values), name = "Observed",
line_color = 'deepskyblue', opacity = og_ts_opacity))
try:
fig.add_trace(go.Scatter(x = pred_ts.index, y = pred_ts, name = model_name,
line_color = 'lightslategrey', opacity = pred_ts_opacity))
except: #if predicted values are a numpy array they won't have an index
fig.add_trace(go.Scatter(x = og_ts.index, y = pred_ts, name = model_name,
line_color = 'lightslategrey', opacity = pred_ts_opacity))
#fig.add_trace(go)
fig.update_layout(title_text = 'Observed test set vs predicted energy MWH values using {}'.format(model_name),
xaxis_rangeslider_visible = True)
fig.show()
def train_test(data, test_size = 0.15, scale = False, cols_to_transform=None, include_test_scale=False):
"""
Perform train-test split with respect to time series structure
- df: dataframe with variables X_n to train on and the dependent output y which is the column 'SDGE' in this notebook
- test_size: size of test set
- scale: if True, then the columns in the -'cols_to_transform'- list will be scaled using StandardScaler
- include_test_scale: If True, the StandardScaler fits the data on the training as well as the test set; if False, then
the StandardScaler fits only on the training set.
"""
df = data.copy()
# get the index after which test set starts
test_index = int(len(df)*(1-test_size))
# StandardScaler fit on the entire dataset
if scale and include_test_scale:
scaler = StandardScaler()
df[cols_to_transform] = scaler.fit_transform(df[cols_to_transform])
X_train = df.drop('SDGE', axis = 1).iloc[:test_index]
y_train = df.SDGE.iloc[:test_index]
X_test = df.drop('SDGE', axis = 1).iloc[test_index:]
y_test = df.SDGE.iloc[test_index:]
# StandardScaler fit only on the training set
if scale and not include_test_scale:
scaler = StandardScaler()
X_train[cols_to_transform] = scaler.fit_transform(X_train[cols_to_transform])
X_test[cols_to_transform] = scaler.transform(X_test[cols_to_transform])
return X_train, X_test, y_train, y_test
Starting off with simple linear regression models to see how they perform on our time series dataset.
# creating categorical columns for linear regression
cat_cols = ['year', 'month', 'day', 'hour', 'weekday', 'season', 'holiday', 'non_working']
for col in cat_cols:
sdge[col] = sdge[col].astype('category')
sdge.info()
# Preparing dummy columns for use in sklearn's linear regression
sdge_lin = pd.get_dummies(sdge, drop_first = True)
sdge_lin.head(3)
sdge_lin.columns
There are multiple external regressor terms, in addition to the time related terms, in the dataset like 'DailyCoolingDegreeDays', 'DailyHeatingDegreeDays', 'HourlyDryBulbTemperature', 'AC_kW', 'cum_AC_kW'. In this project we'll be focusing only on the 'HourlyDryBulbTemperature' and 'cum_AC_kW' columns because 'AC_kW' data has been captured cumulatively in the 'cum_AC_kW' column and the 'DailyCoolingDegreeDays' and 'DailyHeatingDegreeDays' columns are dependent on the daily temperature, so to avoid any correlation between the X variables we will use only the temperature data.
# Checking linear regression fit using statsmodels Linear regression
m = ols('SDGE ~ C(year) + C(month) + C(hour) + C(season) + C(non_working) + \
HourlyDryBulbTemperature + cum_AC_kW', sdge).fit()
print(m.summary())
plot_predvstrue_reg(m.fittedvalues, sdge.SDGE, model_name = 'Statsmodel Simple Linear Regression')
# Keeping only the necessary columns; day columns were removed because there is 0 to none monthly seasonality in the data
sdge_lin.drop(['Date','STATION', 'AC_kW', 'DailyCoolingDegreeDays','DailyHeatingDegreeDays',
'weekday_Monday', 'weekday_Saturday', 'weekday_Sunday', 'weekday_Thursday',
'weekday_Tuesday', 'weekday_Wednesday', 'holiday_1', 'day_2', 'day_3', 'day_4', 'day_5', 'day_6',
'day_7', 'day_8', 'day_9', 'day_10', 'day_11', 'day_12', 'day_13',
'day_14', 'day_15', 'day_16', 'day_17', 'day_18', 'day_19', 'day_20',
'day_21', 'day_22', 'day_23', 'day_24', 'day_25', 'day_26', 'day_27',
'day_28', 'day_29', 'day_30', 'day_31'], axis = 1, inplace = True)
sdge_lin.columns
# Creating the train and test data
# A test_size og 0.15 gives us a training set of 4 years and 3 months and a test set of 9 months (the test set includes
#...winter and summer seasons.
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW'] # other columns are binary values
X_train, X_test, y_train, y_test = train_test(sdge_lin, test_size = 0.15, scale = True, cols_to_transform=cols_to_transform)
# This creates a LinearRegression object
lm = LinearRegression()
lm
# Fitting the linear regression model
lm.fit(X_train, y_train)
# Plotting the coefficients to check the importance of each coefficient
# Plot the coefficients
_ = plt.figure(figsize = (16, 7))
_ = plt.plot(range(len(X_train.columns)), lm.coef_)
_ = plt.xticks(range(len(X_train.columns)), X_train.columns.values, rotation = 90)
_ = plt.margins(0.02)
_ = plt.axhline(0, linewidth = 0.5, color = 'r')
_ = plt.title('sklearn Simple linear regression coefficients')
_ = plt.ylabel('lm_coeff')
_ = plt.xlabel('Features')
#X_train.describe().T[['min', 'max']]
# Plotting lm.predict(X) vs observed energy consumption
plot_predvstrue_reg(lm.predict(X_test), y_test, model_name = 'sklearn Simple linear regression')
error_metrics(lm.predict(X_train), y_train, model_name = 'Simple linear regression with scaling', test = False)
# on test set
error_metrics(lm.predict(X_test), y_test, model_name = 'Simple linear regression with scaling', test = True)
# Plotting the predicted values with the original time series (test set)
plot_ts_pred(y_test, lm.predict(X_test), model_name='Simple linear regression with scaling',
og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
I had mentioned in the introduction that all the models will be compared to a baseline persistent model which is simply a repetition of the last year energy consumption values.
# Calculating the errors for persistent forecast (where we simply repeat the values from last year)
#y_test.index.year
_ = error_metrics(sdge_lin.loc[X_test.index.shift(-8760, freq='H'), 'SDGE'],
y_test, model_name='Baseline Persistent forecast, repeat last year', test=True)
plot_ts_pred(y_test, sdge_lin.loc[X_test.index.shift(-8760, freq='H'), 'SDGE'].to_numpy(),
model_name='Baseline Persistent forecast, repeat last year')
# In the energy forecasting domain, it is even more important to get the daily max demand right. So, calculating the ...
#...error on that too
# Resampling both the y_test and predictions at a 24 hours period and using the max as the aggregate function
_ = error_metrics(sdge_lin.loc[X_test.index.shift(-8760, freq='H'), 'SDGE'].resample('24h').max(),
y_test.resample('24h').max(),
model_name='Baseline Persistent forecast, repeat last year; daily MAX', test=True)
# Trying regularization: Ridge regression
from sklearn.linear_model import Ridge
# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha = 0.2, normalize = True) #selecting alpha=0.2 here.
# Tried different values from 0.1 to 0.8, results don't change much
# Fit the regressor to the data
ridge.fit(X_train, y_train)
# Compute and print the coefficients
ridge_coef = ridge.coef_
#print(ridge_coef)
# Plot the coefficients
_ = plt.figure(figsize = (16, 7))
_ = plt.plot(range(len(X_train.columns)), ridge_coef)
_ = plt.xticks(range(len(X_train.columns)), X_train.columns.values, rotation = 90)
_ = plt.margins(0.02)
_ = plt.axhline(0, linewidth = 0.5, color = 'r')
_ = plt.title('significane of features as per Ridge regularization')
_ = plt.ylabel('Ridge coeff')
_ = plt.xlabel('Features')
################################################################################
# PLotting the residuals
residuals = (y_test - ridge.predict(X_test))
_ = plt.figure(figsize=(7,7))
_ = plt.scatter(ridge.predict(X_test) , residuals, alpha = 0.5)
_ = plt.xlabel("Model predicted energy values")
_ = plt.ylabel("Residuals")
_ = plt.title("Fitted values versus Residuals for Ridge regression")
print('Ridge regression on training set')
error_metrics(ridge.predict(X_train), y_train, model_name = 'Ridge regression with scaling', test = False)
print('\nRidge regression on test set')
error_metrics(ridge.predict(X_test), y_test, model_name = 'Ridge regression with scaling', test = True)
# Plotting the observed test energy and predicted energy data on the same graph as line plots
plot_ts_pred(y_test, ridge.predict(X_test), model_name='Ridge regression', og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
In the above models we had used hot encoded variables like hour_1, hour_2,..., month_2, month_3,.. etc. This results in a loss of information because the model assumes that 23rd hour is far away from the 0th hour (and same for months, month_12 is far away from month_1) which is not the case because the time series is periodic and the 0th hour is as much closer to the 23rd hour as it is to the 1st hour. To avoid the loss of information and to lower the number of predictor variables we'll try some feature engineering on the X space.
We'll also emit the month and weekday columns because the season column captures the months feature anyways. So the number of X variables used in the models below will be considerably lower than the above models.
# Creating a new dataframe similar to the first one
sdge1 = pd.read_csv('hourly1418_energy_temp_PV.csv', index_col = 'Dates', parse_dates=['Dates', 'Date'])
sdge1.head()
# Dividing the hours into 4 groups-> night, morning, afternoon, evening
hour_dict = {'morning': list(np.arange(7,13)),'afternoon': list(np.arange(13,16)), 'evening': list(np.arange(16,22)),
'night': [22, 23, 0, 1, 2, 3, 4, 5, 6]}
hour_dict
def time_of_day(x):
if x in hour_dict['morning']:
return 'morning'
elif x in hour_dict['afternoon']:
return 'afternoon'
elif x in hour_dict['evening']:
return 'evening'
else:
return 'night'
sdge1['time_of_day'] = sdge1['hour'].apply(time_of_day)
sdge1.head()
# creating categorical columns for linear regression
cat_cols1 = ['month', 'day', 'hour', 'weekday', 'season', 'holiday', 'non_working', 'time_of_day']
#not including year above to capture the decreasing energy trend over increasing value of years
for col in cat_cols1:
sdge1[col] = sdge1[col].astype('category')
# Columns to use for regression
cols_use = ['SDGE', 'year', 'time_of_day', 'season', 'non_working', 'HourlyDryBulbTemperature', 'cum_AC_kW']
#['Date','STATION','DailyCoolingDegreeDays','DailyHeatingDegreeDays', 'AC_kW', \
# 'weekday_Monday', 'weekday_Saturday', 'weekday_Sunday', 'weekday_Thursday',\
# 'weekday_Tuesday', 'weekday_Wednesday', 'holiday_1']
sdge1_lin = pd.get_dummies(sdge1[cols_use], drop_first = True)
sdge1_lin.head()
# Creating the train and test data
# Since we had seen in the above models that the energy consumption decreases over the years from 14-18, the year columns was
#..not categorized here and instead used as a regular reressor term
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
X_train, X_test, y_train, y_test = train_test(sdge1_lin, test_size = 0.15, scale = True, cols_to_transform=cols_to_transform,
include_test_scale=False)
# Trying elastic net regression
# Import necessary modules
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
def trend_model(data, cols_to_transform, l1_space, alpha_space, cols_use = 0, scale = True, test_size = 0.15,
include_test_scale=False):
"""
Tuning, fitting and predicting with an Elastic net regression model.
data: time series dataframe including X and y variables
col_use: columns including the y variable to be used from the data
cols_to_transform: columns to be scaled using StandardScaler if scale = True
l1_space: potential values to try for the l1_ratio parameter of the elastic net regression
include_test_scale: If True then the StandardScaler will be fit on the entire dataset instead of just the training set
A note about l1_ratio: The ElasticNet mixing parameter, with 0 <= l1_ratio <= 1.
For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty.
For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2.
"""
# Creating the train test split
if cols_use != 0:
df = data[cols_use]
else:
df = data
X_train, X_test, y_train, y_test = train_test(df, test_size = test_size,
scale = scale, cols_to_transform=cols_to_transform,
include_test_scale=include_test_scale)
# Create the hyperparameter grid
#l1_space = np.linspace(0, 1, 50)
param_grid = {'l1_ratio': l1_space, 'alpha': alpha_space}
# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet()
# for time-series cross-validation set 5 folds
tscv = TimeSeriesSplit(n_splits=5)
# Setup the GridSearchCV object: gm_cv ...trying 5 fold cross validation
gm_cv = GridSearchCV(elastic_net, param_grid, cv = tscv)
# Fit it to the training data
gm_cv.fit(X_train, y_train)
# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet RMSE: {}".format(np.sqrt(mse)))
# fitting the elastic net again using the best model from above
elastic_net_opt = ElasticNet(l1_ratio = gm_cv.best_params_['l1_ratio'])
elastic_net_opt.fit(X_train, y_train)
# Plot the coefficients
_ = plt.figure(figsize = (15, 7))
_ = plt.plot(range(len(X_train.columns)), elastic_net_opt.coef_)
_ = plt.xticks(range(len(X_train.columns)), X_train.columns.values, rotation = 90)
_ = plt.margins(0.02)
_ = plt.axhline(0, linewidth = 0.5, color = 'r')
_ = plt.title('significane of features as per Elastic regularization')
_ = plt.ylabel('Elastic net coeff')
_ = plt.xlabel('Features')
# Plotting y_true vs predicted
_ = plt.figure(figsize = (5,5))
_ = plot_predvstrue_reg(elastic_net_opt.predict(X_test), y_test, model_name = 'Elastic net optimal linear regression')
# returns the train and test X and y sets and also the optimal model
return X_train, X_test, y_train, y_test, elastic_net_opt
data = sdge1_lin
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
l1_space = np.linspace(0, 1, 30)
alpha_space = np.logspace(-2, 0, 30)
# Fitting, tuning and predicting using the best elastic net regression model
import warnings
warnings.filterwarnings('ignore')
X_train, X_test, y_train, y_test, elastic_net_opt = trend_model(data=data, cols_to_transform=cols_to_transform,
l1_space=l1_space, alpha_space=alpha_space,
scale = True, test_size = 0.15, include_test_scale=False)
# Plotting the observed test energy and predicted energy data on the same graph as line plots
plot_ts_pred(y_test, elastic_net_opt.predict(X_test), model_name='Optimal Elastic net regression', \
og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
# printing the error metrics
print('Elastic net regression on training set')
error_metrics(elastic_net_opt.predict(X_train), y_train, model_name = 'Tuned elastic net regression with reduced hour space',
test = False)
print('\nElastic net regression on test set')
error_metrics(elastic_net_opt.predict(X_test), y_test, model_name = 'Tuned elastic net regression with reduced hour space',
test = True)
from sklearn.ensemble import RandomForestRegressor
# Tuning Random forest
# n_estimators = number of trees in the forest
# max_features = max number of features considered for splitting a node
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(10, 200, 10, endpoint=True)]
max_features = ['auto', 'sqrt']
max_depth = list(range(1,6))
# Create the random grid
random_grid = {'n_estimators': n_estimators, 'max_features': max_features, 'max_depth':max_depth}
print(random_grid)
#import randomsearchcv
from sklearn.model_selection import RandomizedSearchCV
# First create the base model to tune
rf = RandomForestRegressor()
# Creating a time series split as discussed in the Introduction
tscv = TimeSeriesSplit(n_splits=5)
# Random search of parameters
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
cv = tscv, verbose=2, random_state = 42, n_jobs = -1)
# Fit the random search model
rf_random.fit(X_train, y_train)
rf_random.best_params_
#rf.fit(X_train, y_train)
rf_random.score(X_train, y_train)
rf_random.score(X_test, y_test) # expect to see overfitting here
# Plotting the observed test energy and predicted energy data on the same graph as line plots
plot_ts_pred(y_test, rf_random.predict(X_test), model_name='Tuned Random forest with reduced hour space', \
og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
# Random forest error metrics
print('Tuned Random forest errors on training set')
error_metrics(rf_random.predict(X_train), y_train, model_name = 'Tuned Random forest with reduced hour space', test = False)
print('\nTuned Random forest errors on test set')
error_metrics(rf_random.predict(X_test), y_test, model_name = 'Tuned Random forest with reduced hour space', test = True)
What makes the time series different from other datasets is that the current y value depends on the previous N values in time depending on the correlation of the data with its lagged version. For example, today's outside temperature can depend on yesterday's outside temperature or maybe depend on the last 5 days of temperature values. The energy consumption values can also be expected to depend on it's previous lagged values because energy consumption of a region shouldn't be expected to change much except for any unexpected or unfortunate events. So we will add the lagged values of energy consumtpion as the X parameters and check if we can predict better using the past values.
sdge1_lin.head(2)
# Adding max 24 lags; lag1 is the value of the energy consumption in the previous hour, lag2 is the value of energy consumption..
#..2 hours before the current value and so on.
# Creating the lag variables
for i in range(24):
sdge1_lin['lag'+str(i+1)] = sdge1_lin['SDGE'].shift(i+1)
# Since the first 24 values won't have any 24th lag, they will be NaN. So dropping the NaNs
lag_sdge = sdge1_lin.dropna()
lag_sdge.head(2)
# Fitting Elastic Net Regression model using the lag variables as the additional inputs ...
#...(extending the reduced space model)
# Not tuning the elastic net this time because we will see that there hardly any scope for improvement in the excellent results
#..we get
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
# Adding the energy consumption lags to the columns to transform
list_lags = ['lag'+str(i+1) for i in range(24)]
cols_to_transform.extend(list_lags)
X_train_lag, X_test_lag, y_train_lag, y_test_lag = train_test(lag_sdge, \
test_size = 0.15, scale = True, \
cols_to_transform=cols_to_transform)
elastic_net_lag = ElasticNet(l1_ratio = 1, alpha=0.2)
elastic_net_lag.fit(X_train_lag, y_train_lag)
# Plot the coefficients
_ = plt.figure(figsize = (16, 7))
_ = plt.plot(range(len(X_train_lag.columns)), elastic_net_lag.coef_)
_ = plt.xticks(range(len(X_train_lag.columns)), X_train_lag.columns.values, rotation = 90)
_ = plt.margins(0.02)
_ = plt.axhline(0, linewidth = 0.5, color = 'r')
_ = plt.title('significane of features as per Elastic regularization with scaling and including lags')
_ = plt.ylabel('Linear reg coeff')
_ = plt.xlabel('Features')
X_train_lag.columns.values
#rflag = RandomForestRegressor()
#rflag.fit()
# First create the base model to tune
rf = RandomForestRegressor()
tscv = TimeSeriesSplit(n_splits=5)
# Random search of parameters
rflag = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
cv = tscv, verbose=2, random_state = 42, n_jobs = -1)
# Fit the random search model
rflag.fit(X_train_lag, y_train_lag)
rflag.best_params_
#rf.fit(X_train, y_train)
# comparison of scores using elastic net and RF on the lag dataset
print('Random forest errors with lags on training')
error_metrics(rflag.predict(X_train_lag), y_train_lag, model_name = 'Random forest with all lags', test=False)
print('\nRandom forest errors with lags on test')
error_metrics(rflag.predict(X_test_lag), y_test_lag, model_name = 'Random forest with all lags', test=True)
print('\nElastic net errors with lags on train')
error_metrics(elastic_net_lag.predict(X_train_lag), y_train_lag, model_name = 'Elastic net with all lags', test=False)
print('\nElastic net errors with lags on test')
error_metrics(elastic_net_lag.predict(X_test_lag), y_test_lag, model_name = 'Elastic net with all lags', test=True)
plot_predvstrue_reg(elastic_net_lag.predict(X_test_lag), y_test_lag,
model_name = 'Elastic Net Regression with all lags test set')
plot_predvstrue_reg(rflag.predict(X_test_lag), y_test_lag,
model_name = 'Random Forest Regression with all lags test set')
# Plotting the observed test values, and predicted test values using the above Elastic net and Random forest models using lags
#..up to last 24 hour values
fig = go.Figure()
fig.add_trace(go.Scatter(x = X_test_lag.index, y = np.array(y_test_lag), name = "observed",
line_color = 'deepskyblue', opacity = 0.5))
fig.add_trace(go.Scatter(x = X_test_lag.index, y = rflag.predict(X_test_lag), name = "Random forest predictions",
line_color = 'lightslategrey', opacity = 0.5))
fig.add_trace(go.Scatter(x = X_test_lag.index, y = elastic_net_lag.predict(X_test_lag), name = "Elastic net predictions",
line_color = 'lightcoral', opacity = 0.5))
fig.update_layout(title_text = 'Observed test set vs Predicted energy using Random forest & elastic net \
reg using lags upto 24h on test set',
xaxis_rangeslider_visible = True)
fig.show()
# Creating a simple time series dataframe
ts_sdge = pd.DataFrame(sdge1_lin["SDGE"], columns=['SDGE'])
ts_sdge.tail()
# Plotting our original time series
plot_timeseries(ts_sdge['SDGE'], title = 'Original data set')
from statsmodels.graphics import tsaplots
import statsmodels.api as sm
# Run seasonal decompose
decomp = sm.tsa.seasonal_decompose(ts_sdge['SDGE'], freq=24*365) # capturing the yearly seasonal component; i.e. for example
#... every July the energy consumption is high and then it gets lower during the winter months.
#...So, a periodicity of 24hours*365 was used here.
print(decomp.seasonal.head()) # checking the seasonal component
_ = decomp.plot()
ts_sdge['seasonal'] = decomp.seasonal
# Plotting the seasonal coponent only
plot_timeseries(ts_sdge['seasonal'], title = 'Seasonal component')
ts_sdge['trend'] = decomp.trend
# Plotting the trend (the starting 6 months and the end 6 months of data is NaN because we had set the periodicity as 24*365,
#... so the trend data will have 8760 less values)
ts_sdge['trend'].dropna().plot()
# Plotting the half yearly rolling average of the time series to check trend (mean consumption)
_ = sdge['SDGE'].rolling(window = 24*30*12).mean().plot(figsize=(12,5))
_ = plt.title('Checking trend in the data by averaging yearly values')
# Plotting the quartely rolling MAX of the time series to check trend
_ = sdge['SDGE'].rolling(window = 24*30*12).max().plot(figsize=(12,5))
_ = plt.title('Checking trend in the data by taking the MAX of yearly values')
Before we move forward we need to discuss an important property of the time series data which is Stationarity.
From the above definition of stationarity it is clear that our energy consumption dataset is not stationary because it has a trend (changing mean) and also seasonality which means that the covariance function does depend on time. The simplest way to identify stationarity (or its absence) is viewing the data as we did above by decomposing it but the most common way of testing a dataset for stationarity is the Dicky Fuller test which tests for a unit root. Basically, if the p-value of the test is too small (say less than 0.05, giving us 95% confidence) then we reject the hypothesis that our data is non-stationary and assume that it is stationary indeed.
And the most common way of dealing with stationarity is differencing our dataset. Let's see some examples below.
# Differencing the data once
decomp_diff1 = sm.tsa.seasonal_decompose(ts_sdge['SDGE'].diff().dropna(), freq=24*365)
_ = decomp_diff1.plot()
# Differencing the data with periodicity of 24 hours in addition to the original single differencing
decomp_diff24 = sm.tsa.seasonal_decompose(ts_sdge['SDGE'].diff().dropna().diff(24).dropna(), freq=24*365)
_ = decomp_diff24.plot()
from statsmodels.tsa.stattools import adfuller
def run_adfuller(ts):
result = adfuller(ts)
# Print test statistic
print("t-stat", result[0])
# Print p-value
print("p-value", result[1])
# Print #lags used
print("#lags used", result[2])
# Print critical values
print("critical values", result[4])
print("for no differencing\n")
run_adfuller(ts_sdge['SDGE'])
print("\nfor single differencing\n")
run_adfuller(ts_sdge['SDGE'].diff().dropna())
print("\nfor differenced data set over lags 24 after single differencing\n")
run_adfuller(ts_sdge['SDGE'].diff().dropna().diff(24).dropna())
Simple techniques like rolling average of the past values is often useful to visualize the trend in a dataset. The simplest model for a time series will be a persistent model where simply $y_t = y_{t-1}$. Most of the times the time series models are compared to this naive model to test their performance. A more advance smoothing technique is called exponential smoothing and it is given as:
$$\hat{y}_{t} = \alpha \cdot y_t + (1-\alpha) \cdot \hat y_{t-1} $$Here the model value is a weighted average between the current true value and the previous model values. The $\alpha$ weight is called a smoothing factor. It defines how quickly we will "forget" the last available true observation. The smaller $\alpha$ is, the more influence the previous observations have and the smoother the series is.
# Defining a function to plot exponentially smoothed data
def plot_ewma(ts, alpha):
expw_ma = ts.ewm(alpha=alpha).mean()
plot_ts_pred(ts, expw_ma, model_name='Exponentially smoothed data with alpha = {}'.format(alpha), \
og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
# Plotting the exponentially smoothed energy consumption data with alpha=0.3
plot_ewma(ts_sdge['SDGE'], 0.3)
The EWMA does a good job of modeling the future values but it requires N number of lag variables depending on the value of alpha and also it's a very simplistic model for our energy consumption data.
There are other advanced models like the Triple exponential smoothing a.k.a Holt Winters model but I am skipping that model here because we can try better models like FB Prophet and ARIMA on our data (which seems to have a trend and multiple seasonality).
The correlation of the time series observations calculated with values of the same series at previous times, is called a serial correlation, or an autocorrelation (ACF). It is used to determine the moving average (MA or q) term of the ARIMA(p,d,q) models.
A partial autocorrelation (PACF) is a summary of the relationship between an observation in a time series with observations at prior time steps with the relationships of intervening observations removed. It is used to determine the auto regression (AR or p) term of the ARIMA(p,d,q) models.
And we have observed till now that our energy consumption time series has a strong correlation with its past day value (lag of 24 hours) and also its past value 365*24 hours ago. In addition a weekly seasonality is also observed in the energy consumption. Let's plot the ACF and PACF plots for our energy time series data.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
def acf_pacf_plots(ts, lags, figsize = (12,8)):
fig, (ax1, ax2) = plt.subplots(2, 1, figsize = figsize)
# Plot the ACF of ts
_ = plot_acf(ts, lags = lags, zero = False, ax = ax1, alpha = 0.05)
# Plot the PACF of ts
_ = plot_pacf(ts, lags = lags, zero = False, ax = ax2, alpha = 0.05)
# Create the acf and pacf plots
dfacf = []
dfacf = ts_sdge['SDGE']
lags = 50
acf_pacf_plots(dfacf, lags = lags, figsize = (12,8))
The seasonal period of 24 hours can be easily observed from the plots above.
# Differencing the data this time to remove the trend and seasonality
dfacf = []
dfacf = ts_sdge['SDGE']
dfacf = dfacf.diff().dropna()
dfacf = dfacf.diff(24).dropna()
dfacf = dfacf.diff(24*365).dropna()
lags = 100
acf_pacf_plots(dfacf, lags = lags, figsize = (12,8))
Even after differencing and seasonal differencing the data, the ACF and PACF plots are not very informative, they do not tell us which AR and MA terms will be good for the SARIMAX model (S stands for seasonality and X for exogenous variables).
This is a common pitfall of the ACF, PACF plots and SARIMAX models. They cannot handle multiple seasonality. And our energy consumption dataset has multiple seasonalities - daily (24 hours), weekly (168hours), yearly (24*365.25). We'll have to deal with the multiple seasonality separately to be able to use SARIMAX model on our energy consumption time series
There are two interesting time series forecasting methods called BATS and TBATS that are capable of modeling time series with multiple seasonalities link. Unfortunately BATS and TBATS capabilities do not come for free. The method is very generic. Under the hood it builds and evaluates many model candidates. This results in slowness of the computation. And SARIMAX models with Fourier series handling the multiple seasonalities can perform as good as the TBATS model, so we will opt for the simpler model here i.e. SARIMAX. We will need to create some extra features here to model the multiple seasonalities.
import warnings
warnings.filterwarnings('ignore')
# We can add multiple fourer series with different k terms in the (2*k*pi) such as k=1,2,3...etc. To generalize the problem,
# we could have chosen an optimal k value for each season by trying out some k values and choosing the values giving
#the lowest AIC.
def add_fourier_terms(df, year_k, week_k, day_k):
"""
df: dataframe to add the fourier terms to
year_k: the number of Fourier terms the year period should have. Thus the model will be fit on 2*year_k terms (1 term for
sine and 1 for cosine)
week_k: same as year_k but for weekly periods
day_k:same as year_k but for daily periods
"""
for k in range(1, year_k+1):
# year has a period of 365.25 including the leap year
df['year_sin'+str(k)] = np.sin(2 *k* np.pi * df.index.dayofyear/365.25)
df['year_cos'+str(k)] = np.cos(2 *k* np.pi * df.index.dayofyear/365.25)
for k in range(1, week_k+1):
# week has a period of 7
df['week_sin'+str(k)] = np.sin(2 *k* np.pi * df.index.dayofweek/7)
df['week_cos'+str(k)] = np.cos(2 *k* np.pi * df.index.dayofweek/7)
for k in range(1, day_k+1):
# day has period of 24
df['hour_sin'+str(k)] = np.sin(2 *k* np.pi * df.index.hour/24)
df['hour_cos'+str(k)] = np.cos(2 *k* np.pi * df.index.hour/24)
warnings.filterwarnings('ignore')
# as said above the k terms for each yearly, weekly and daily seasonalities could be chosen by optimizing on the AIC values..
# but after some research on energy consumption time series k= 5 was chosen for each seasonality
add_fourier_terms(lag_sdge, year_k= 5, week_k=5 , day_k=5)
#ph = -0.7
#y1 = np.sin(2 * np.pi * lag_sdge.index.hour/24+ph) + np.cos(2 * np.pi * lag_sdge.index.hour/24+ph)
#y2 = np.sin(2 * 2* np.pi * lag_sdge.index.hour/24) + 0.7*np.cos(2 *2* np.pi * lag_sdge.index.hour/24)
#y3 = np.sin(2 * 2* np.pi * lag_sdge.index.hour/24) + np.cos(2 *2* np.pi * lag_sdge.index.hour/24)
#plt.plot(lag_sdge.index, 1-(y1+y2))
#plt.xlim(['2014-01-02', '2014-01-03'])
#plt.xticks(rotation=90)
# Visualizing the new variables on hour seasonality
pd.plotting.register_matplotlib_converters()
_ = (1-lag_sdge.loc['01-02-2014':'01-02-2014', [col for col in lag_sdge if col.startswith('hour')]]).sum(axis = 1).plot()
# Visualizing the new variables on year seasonality
_ = (1-lag_sdge.loc['01-01-2014':'01-01-2015', [col for col in lag_sdge if col.startswith('year')]]).sum(axis = 1).plot()
# Visualizing the new variables on week seasonality
_ = (1-lag_sdge.loc['01-01-2014':'01-09-2014', [col for col in lag_sdge if col.startswith('week')]]).sum(axis = 1).plot()
lag_sdge.columns
# Choosing a subset of the above dataframe; removing the lags and the hour bins
sdgecyc = lag_sdge.drop([col for col in lag_sdge if
col.startswith('time') or col.startswith('season') or col.startswith('lag')], axis=1)
from statsmodels.tsa.statespace.sarimax import SARIMAX
import pmdarima as pm
# defining the exogenous variables for the SARIMAX model
#sdgecyc.drop('SDGE', axis=1)
sdgecyc.columns
# Creating the training and test datasets
# check https://medium.com/@josemarcialportilla/using-python-and-auto-arima-to-forecast-seasonal-time-series-90877adff03c
# not adding year variable here because the model will use the most recent lag energy consumption values
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW']
X_train_lag, X_test_lag, y_train_lag, y_test_lag = train_test(sdgecyc,
test_size = 0.15, scale = True,
cols_to_transform=cols_to_transform)
# Since ARIMA model uses the past lag y values, scaling the energy values as well.
#i.e. fit the scaler on y_train and transform it and also transform y_test using the same scaler
scaler1 = StandardScaler()
y_train_lag = pd.DataFrame(scaler1.fit_transform(y_train_lag.values.reshape(-1,1)), index = y_train_lag.index,
columns = ['SDGE'])
# y_test_lag = scaler1.transform(y_test_lag)
# Got the optimal SARIMAX order from running the above auto_arima function. It took more than 2 hours to get the optimal...
#...solution. So, just using the final values here
model_opt = SARIMAX(y_train_lag, order=(2,1,1), seasonal_order=(1, 0, 1, 24), exog = X_train_lag, trend='c')
results = model_opt.fit()
results.summary()
#X_train_lag.columns
# plotting the residuals and checking if they meet the i.i.d requirements
_ = results.plot_diagnostics(figsize=(12, 7))
# Predictions on test set. Predicting only the last week of the training set
# Setting dynamic = True so that the model won't use actual enegy values for prediction. Basically the model will use
# the lag terms and moving average terms of the already forecasted energy values. So, we will see the errors
#(confidence interval) increasing with each forecast.
pred = results.get_prediction(start=X_train_lag.index[-24*7], end=X_train_lag.index[-1],
dynamic=True, exog=X_train_lag.iloc[-24*7:, :])
pred_ci = pred.conf_int()
pred1 = scaler1.inverse_transform(pred.predicted_mean)
pred_ci1 = scaler1.inverse_transform(pred.conf_int())
y_actual_train = np.squeeze(scaler1.inverse_transform(y_train_lag))
y_actual_train = pd.Series(y_actual_train, index = X_train_lag.index )
pred1 = pd.Series(pred1, index = X_train_lag.iloc[-24*7:, :].index )
pred_ci1 = pd.DataFrame(pred_ci1, index = pred1.index, columns = ['lower SDGE', 'upper SDGE'])
lower_limits = pred_ci1.loc[:,'lower SDGE']
upper_limits = pred_ci1.loc[:,'upper SDGE']
# Error on training set for 1 week ahead forecast
error_metrics(pred1, y_actual_train.iloc[-24*7:], 'SARIMAX(2,1,1)x(1,0,1,24) with Fourier terms 1 week',
test=False)
#plot_ts_pred(y_actual_train.iloc[-24*7:], pred1 , model_name='SARIMA (0,1,2)x(1,0,1,24)',
# og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
# Predictions on test set. Predicting only the 1st week of the test set
pred = results.get_forecast(steps = 24*7, exog = X_test_lag.iloc[:24*7, :])
pred_ci = pred.conf_int()
pred2 = scaler1.inverse_transform(pred.predicted_mean)
pred_ci2 = scaler1.inverse_transform(pred.conf_int())
y_actual = y_test_lag.iloc[:24*7]
pred2 = pd.Series(pred2, index = X_test_lag.iloc[:24*7, :].index )
pred_ci2 = pd.DataFrame(pred_ci2, index = pred2.index, columns = ['lower SDGE', 'upper SDGE'])
lower_limits = pred_ci2.loc[:,'lower SDGE']
upper_limits = pred_ci2.loc[:,'upper SDGE']
# plot the predictions
plt.figure(figsize = (15,7))
_ = plt.plot(y_actual.index, y_actual, label='observed')
# plot your mean predictions
_ = plt.plot(pred2.index, pred1.values, color='r', label='forecast')
# shade the area between your confidence limits
_ = plt.fill_between(lower_limits.index, lower_limits, upper_limits, color='pink')
# set labels, legends and show plot
_ = plt.xlabel('Date')
_ = plt.ylabel('Energy consumption MWH')
_ = plt.legend()
_ = plt.title('Testing the SARIMAX (2,1,1)x(1,0,1,24) model on the testing set: predicting the 1st week')
# Checking if the SARIMA model captures the long term seasonality
#plot_ts_pred(y_actual, pred1 , model_name='SARIMA (0,1,2)x(1,0,1,24)',
# og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
# It fails terribly to capture the long term seasonality
error_metrics(pred2, y_actual, 'SARIMAX(2,1,1)x(1,0,1,24) with Fourier terms 1 week', test=True)
sdgecyc.head(2)
Traditional approaches like SARIMA models often require manual data pre-processing steps (e.g. differencing to make the data stationary) and it’s also hard to explain why these models produce the prediction results to people without forecasting expertise. In addition, these models are not allowed to add additional domain knowledge to improve precision. For solving these problems, Facebook researchers recently released FBProphet, a time series forecasting tool supporting both Python and R.
FBProphet provides a decomposition regression model that is extendable and configurable with interpretable parameters. Prophet frames the forecasting problem as a curve-fitting exercise rather than looking explicitly at the time based dependence of each observation within a time series. Similar to SARIMAX, we can add extra regressor terms like temperature data to the model as well.
At it’s core, Prophet is an additive model with the following components: $$y(t) = g(t) + s(t) + h(t) + ϵₜ$$
$g(t)$ models trend, which describes long-term increase or decrease in the data. Prophet incorporates two trend models, a saturating growth model and a piecewise linear model, depending on the type of forecasting problem.
$s(t)$ models seasonality with Fourier series, which describes how data is affected by seasonal factors such as the time of the year
$h(t)$ models the effects of holidays or large events that impact business time series (e.g. new product launch, Black Friday, Superbowl, etc.)
$ϵₜ$ represents an irreducible error term
# cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW']
X_trainP, X_testP, y_trainP, y_testP = train_test\
(lag_sdge[['SDGE', 'HourlyDryBulbTemperature','cum_AC_kW', 'non_working_working']],
test_size=0.15,
scale=False, #True
#cols_to_transform=cols_to_transform,
include_test_scale=False)
Prophet requires the input data in a particular format, so preparing the data for Prophet.
# preparing data for Prophet
def data_prophet(X_train, X_test, y_train, y_test):
data_train = pd.merge(X_train, y_train, left_index=True, right_index=True)
data_train = data_train.reset_index().rename(columns = {'SDGE':'y', 'Dates':'ds'})
data_test = pd.merge(X_test, y_test, left_index=True, right_index=True)
data_test = data_test.reset_index().rename(columns = {'SDGE':'y', 'Dates':'ds'})
return data_train, data_test
data_train, data_test = data_prophet(X_trainP, X_testP, y_trainP, y_testP)
data_train.tail(3)
# Importing Prophet
from fbprophet import Prophet
# Initiating fbprophet model; set the uncertainty interval to 95% (the Prophet default is 80%)
prop = Prophet(growth='linear', interval_width = 0.95,
yearly_seasonality='auto',
weekly_seasonality='auto',
daily_seasonality='auto',
seasonality_mode='additive',
#seasonality_prior_scale = 15
)
# Adding seasonality
#prop.add_seasonality(name='daily', period = 1, fourier_order = 15, prior_scale=20)
#prop.add_seasonality(name='weekly', period = 7, fourier_order = 10, prior_scale=20)
#prop.add_seasonality(name='yearly', period = 365.25, fourier_order = 20, prior_scale=20)
# Adding regressors
prop.add_regressor('HourlyDryBulbTemperature', prior_scale=20, mode='additive', standardize=True)
prop.add_regressor('cum_AC_kW', prior_scale = 1, mode='additive', standardize=True)
prop.add_regressor('non_working_working', prior_scale=10, mode='additive', standardize='auto') #'auto'=> standardize if not
#.binary
# Fitting the model to training data
prop.fit(data_train)
# Creating the dataframe with datetime values to predict on (making predictions on train as well as the test set)
future_dates = prop.make_future_dataframe(periods=len(data_test), freq='H', include_history=True)
# Aadding regressors
future_dates = pd.merge(future_dates, (data_train.append(data_test)).drop('y', axis=1), on = 'ds')
future_dates.tail(3)
# Predicting the future
forecast = prop.predict(future_dates)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
#pd.plotting.register_matplotlib_converters()
_ = prop.plot(forecast, uncertainty = True, xlabel = 'Dates', ylabel = 'Energy consumption MWH')
# Plotting the components of the results
_ = prop.plot_components(forecast)
#pd.plotting.register_matplotlib_converters()
plot_ts_pred(data_train.append(data_test).set_index('ds')['y'], forecast.set_index('ds')['yhat'], \
model_name='FB Prophet with auto seasonality', og_ts_opacity = 0.5, pred_ts_opacity = 0.5)
FB Prophet also provides an option for performing cross validation on the training set. It needs an inital, period and horizon as inputs. $Initial$ is the training sub set to be used for training the model on, $period$ is the number of time frames the data should be shifted forward by for extending the training subset and $horizon$ is the window of forecast. This is similar to the time series split described in the Introduction.
from fbprophet.diagnostics import cross_validation
len(y_trainP)*2//3
Randomly checking the performance of FB Prophet on 1 week ahead forecast within the training set.
# Cross validating on the training set with an initial training period of 18615 hours (half of the training set) and a forecast.
# ..horizon of 168 hours (1 week ahead)
df_cv = cross_validation(prop, initial='24820 hours', period='1200 hours', horizon = '168 hours')
df_cv.head(3)
error_metrics(df_cv['yhat'], df_cv['y'], 'FB Prophet with auto seasonality 1 week ahead', test=False)
error_metrics(forecast.iloc[-len(data_test['y']):, ]['yhat'], data_test['y'],
'FB Prophet with auto seasonality',
test=True)
sdgecyc.head(2)
# Fitting Elastic Net Regression model on the data with the fourier series
#sdgecyc['year'] = lag_sdge['year']
data = sdgecyc
cols_to_transform = ['HourlyDryBulbTemperature', 'cum_AC_kW', 'year']
l1_space = np.linspace(0, 1, 30)
alpha_space = np.logspace(-2, 0, 30)
X_trainF, X_testF, y_trainF, y_testF, elastic_net_opt_F = trend_model(data=data, cols_to_transform=cols_to_transform,
l1_space=l1_space, alpha_space=alpha_space,
scale = True, test_size = 0.15, include_test_scale=False)
random_grid['max_depth'] = [3,4,5,6,7,8]
random_grid
# First create the base model to tune
rf = RandomForestRegressor()
tscv = TimeSeriesSplit(n_splits=5)
# Random search of parameters
rfF = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
cv = tscv, verbose=2, random_state = 42, n_jobs = -1)
# Fit the random search model
rfF.fit(X_trainF, y_trainF)
rfF.best_params_
#rf.fit(X_train, y_train)
rfFourier = RandomForestRegressor(n_estimators= rfF.best_params_['n_estimators'],
max_features=rfF.best_params_['max_features'],
max_depth= rfF.best_params_['max_depth'], random_state = 42)
rfFourier.fit(X_trainF, y_trainF)
# comparison of scores using elastic net and RF
print("Training accuracy R2 using random forest is {}".format(rfF.score(X_trainF, y_trainF)))
print("Training accuracy R2 using Elastic net is {}".format(elastic_net_opt_F.score(X_trainF, y_trainF)))
print("\nTest accuracy R2 using random forest is {}".format(rfF.score(X_testF, y_testF)))
print("Test accuracy R2 using Elastic net is {}".format(elastic_net_opt_F.score(X_testF, y_testF)))
fig = go.Figure()
fig.add_trace(go.Scatter(x = X_testF.index, y = np.array(y_testF), name = "observed",
line_color = 'deepskyblue', opacity = 0.5))
fig.add_trace(go.Scatter(x = X_testF.index, y = rfF.predict(X_testF), name = "Random forest predictions",
line_color = 'lightslategrey', opacity = 0.5))
fig.add_trace(go.Scatter(x = X_testF.index, y = elastic_net_opt_F.predict(X_testF), name = "Elastic net predictions",
line_color = 'lightcoral', opacity = 0.5))
fig.update_layout(title_text = 'Observed test set vs Predicted energy using Random forest and elastic net reg on data with \
Fourier series',
xaxis_rangeslider_visible = True)
fig.show()
plot_predvstrue_reg(rfF.predict(X_testF), y_testF, model_name='Random forest with fourier terms test set')
# comparison of scores using elastic net and RF on the data with fourier terms
print('Random forest errors with fourier terms on training')
error_metrics(rfF.predict(X_trainF), y_trainF, model_name = 'Tuned Random forest with fourier terms', test=False)
print('\nRandom forest errors with fourier terms on test')
error_metrics(rfF.predict(X_testF), y_testF, model_name = 'Tuned Random forest with fourier terms', test=True)
print('\nElastic net errors with fourier terms on train')
error_metrics(elastic_net_opt_F.predict(X_trainF), y_trainF, model_name = 'Tuned Elastic net with fourier terms', test=False)
print('\nElastic net errors with fourier terms on test')
error_metrics(elastic_net_opt_F.predict(X_testF), y_testF, model_name = 'Tuned Elastic net with fourier terms', test=True)
# checking the feature importance for the random forest model
feat_imp = pd.DataFrame({'importance':rfFourier.feature_importances_})
feat_imp['feature'] = X_trainF.columns
feat_imp.sort_values(by='importance', ascending=False, inplace=True)
#feat_imp = feat_imp.iloc[:top_n]
feat_imp.sort_values(by='importance', inplace=True)
feat_imp = feat_imp.set_index('feature', drop=True)
_ = feat_imp.plot.barh(title = 'Random Forest feature importance', figsize = (12,7))
# # checking the feature importance for the Elastic net model
feat_imp = pd.DataFrame({'importance':np.abs(elastic_net_opt_F.coef_)})
feat_imp['feature'] = X_trainF.columns
feat_imp.sort_values(by='importance', ascending=False, inplace=True)
#feat_imp = feat_imp.iloc[:top_n]
feat_imp.sort_values(by='importance', inplace=True)
feat_imp = feat_imp.set_index('feature', drop=True)
_ = feat_imp.plot.barh(title = 'Elastic net feature importance', figsize = (12,7))
XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library.
XGBoost is well known to provide better solutions than other machine learning algorithms. It is not often used for time series, especially if the base used is trees because it is difficult to catch the trend with trees, but since our data doesn't have a very significant trend and also since it has multiple seasonalities and depends sinificantly on an exogenous variable like temperature, we can try XGboost to see how it performs on the time series data of energy consumption.
import xgboost as xgb
# convert the dataset into an optimized data structure called Dmatrix that XGBoost supports and
# gives it acclaimed performance and efficiency gains
data_dmatrix = xgb.DMatrix(data = sdgecyc.drop('SDGE', axis=1), label = sdgecyc['SDGE']) # not used
# generating the model
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 3, alpha = 5, n_estimators = 100, random_state=42)
# Fitting the model on the training set
xg_reg.fit(X_trainF, y_trainF)
# Predicting on test set
preds_boost = xg_reg.predict(X_testF)
_ = error_metrics(preds_boost, y_testF, model_name='XGBoost with Fourier terms', test=True)
_ = error_metrics(xg_reg.predict(X_trainF), y_trainF, model_name='XGBoost with Fourier terms', test= False)
plot_ts_pred(y_trainF, xg_reg.predict(X_trainF), model_name='XGBoost with Fourier terms on Training set')
plot_ts_pred(y_testF, preds_boost, model_name='XGBoost with Fourier terms on Test set')
plot_predvstrue_reg(preds_boost,y_testF, model_name='XGBoost with Fourier terms on Test set')
# converting the predictions into series so that we can use the .resample method on it
preds_boost_series = pd.Series(preds_boost, index = y_testF.index )
# Resampling both the y_test and predictions at a 24 hours period and using the max as the aggregate function
_ = error_metrics(preds_boost_series.resample('24h').max(),
y_testF.resample('24h').max(),
model_name='XGBoost with Fourier terms; daily MAX', test=True)
# ...on training set max
preds_train = pd.Series(xg_reg.predict(X_trainF), index = y_trainF.index )
# Resampling both the y_test and predictions at a 24 hours period and using the max as the aggregate function
_ = error_metrics(preds_train.resample('24h').max(),
y_trainF.resample('24h').max(),
model_name='XGBoost with Fourier terms; daily MAX', test=False)
# For plotting the tree
import os
os.environ["PATH"] += os.pathsep + 'C:/Users/ppawar/graphviz-2.38/release/bin'
#pd.plotting.register_matplotlib_converters()
xgb.plot_tree(xg_reg,num_trees=98) # Choosing the last few trees because the model optimizes sequentially
plt.rcParams['figure.figsize'] = [30, 40]
#pd.plotting.register_matplotlib_converters()
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [15, 7]
random_grid
# Tuning the XGBoost model
xgbtuned = xgb.XGBRegressor()
param_grid = {
'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
'learning_rate': [0.001, 0.01, 0.1, 0.2, 0.3],
'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0],
'gamma': [0, 0.25, 0.5, 1.0],
'n_estimators': [10, 31, 52, 73, 94, 115, 136, 157, 178, 200]}
tscv = TimeSeriesSplit(n_splits=3)
xgbtunedreg = RandomizedSearchCV(xgbtuned, param_distributions=param_grid ,
scoring='neg_mean_squared_error', n_iter=20, n_jobs=-1,
cv=tscv, verbose=2, random_state=42)
xgbtunedreg.fit(X_trainF, y_trainF)
best_score = xgbtunedreg.best_score_
best_params = xgbtunedreg.best_params_
print("Best score: {}".format(best_score))
print("Best params: ")
for param_name in sorted(best_params.keys()):
print('%s: %r' % (param_name, best_params[param_name]))
preds_boost_tuned = xgbtunedreg.predict(X_testF)
_ = error_metrics(preds_boost_tuned, y_testF, model_name='Tuned XGBoost with Fourier terms', test=True)
_ = error_metrics(xgbtunedreg.predict(X_trainF), y_trainF, model_name='Tuned XGBoost with Fourier terms', test=False)
plot_ts_pred(y_testF, preds_boost_tuned, model_name='Tuned XGBoost with Fourier terms on test')
plot_predvstrue_reg(preds_boost_tuned,y_testF, model_name='Tuned XGBoost with Fourier terms on test set')
As discussed earlier, to model a time series data it needs to be stationary. So, the ideal case would be to detrend the data and then feed it into the ML models and then add the trend to the forecasted results. Nonetheless good results were obtained above without detrending because the energy consumption data from 2014 to 2018 has a very weak trend and the multiple seasonalities were handled well by the Fourier terms.
Also, the effect of cum_AC_kW, which is the cumulative PV installation till date, can be modeled using FB Prophet and then merged with the XGBoost's forecast. Any tree based regression model like XGBoost cannot handle the X variables like cum_AC_kW because it is an ever increasing variable and the test data will always have higher magnitude values not seen by the model in the training set.
I have extracted the trend and cum_AC_kW impact on energy from the FB Prophet model and subtracted these two components from our main dataframe with all the fourier terms. Then this detrended energy consumption data was passed onto the XGBoost model and the XGBoost forecast results were added back to the total trend to get the final predictions.
_ = plt.figure(figsize = (8,5))
_ = (forecast.trend + forecast.cum_AC_kW).plot()
fxdata = sdgecyc.copy()
fxdata.drop(['cum_AC_kW', 'year'], axis = 1, inplace=True)
# Detrending the data
fxdata['SDGE'] = fxdata['SDGE'].to_numpy() - (forecast.trend + forecast.cum_AC_kW).to_numpy()
fxdata['SDGE'].isna().sum()
pd.plotting.register_matplotlib_converters()
fxdata['SDGE'].plot()
data1 = fxdata
cols_to_transform = ['HourlyDryBulbTemperature'] # technically we don't need to standardize data for tree based models
X_trainFP, X_testFP, y_trainFP, y_testFP = train_test(data=data1, cols_to_transform=cols_to_transform,
scale = True, test_size = 0.15, include_test_scale=False)
# generating the model
xg_FP = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
max_depth = 5, alpha = 5, n_estimators = 73, random_state=42)
# Fitting the model on the training set
xg_FP.fit(X_trainFP, y_trainFP)
# Predicting on test set
preds_boostFP = xg_FP.predict(X_testFP)
# adding back the trend
trend_and_PV = forecast.trend + forecast.cum_AC_kW
boost_add_trend = preds_boostFP + trend_and_PV[-len(y_testF):]
plot_ts_pred(y_testF, boost_add_trend.to_numpy(), model_name='XGBoost with detrend, Fourier terms on test')
plot_predvstrue_reg(boost_add_trend.to_numpy(),y_testF, model_name='XGBoost with detrend Prophet, Fourier terms on test')
_ = error_metrics(boost_add_trend, y_testF, model_name='XGBoost with detrend Prophet, Fourier terms', test=True)
# on training set
preds_boostFP_train = xg_FP.predict(X_trainFP)
# adding back the trend
boost_add_trend_train = preds_boostFP_train + trend_and_PV[:(len(trend_and_PV) - len(y_testF))]
_ = error_metrics(boost_add_trend_train, y_trainF, model_name='XGBoost with detrend Prophet, Fourier terms', test=False)
Different models were tried to forecast the energy consumption in MWH of the San Diego Gas and Electric (SDGE) utility region. The energy consumption is highly dependent on the outside temperature and has strong multiple seasonalities - daily, weekly and yearly. Also, the energy consumption has decreased slightly from 2014 to 2018. The increasing PV (photovoltaic) installations in the region (cum_AC_kW) seems to have brought the decreasing trend in the energy consumption since more renewable energy at customer's facility means less load on the utility. But there can be other factors causing this decreasing trend such as the energy storage installations at the customer facilities, increase in electric efficiencies of the household and commerical equipment, people becoming more conscious of their usage (morally or through utility incenitves), etc.
The best way to capture the trend, which is a combination of all the above factors and maybe more, is to make the model learn the trend over a long period of time (more than 3 years at least). The seasonality is an important part in predicting the energy consumption of a region, so getting that part right was also very crucial for improving the model's performance.
Different models were tried and here is a summary of the error metric for each model including the baseline model where today's energy consumption is equal to the last year's energy consumption at the same hour:
trydf = pd.DataFrame.from_dict(dict_error)
#trydf.sort_values('MAPE', ascending=True).groupby(['model', 'train_test']).mean()
sorted_errors = trydf.pivot_table(index = 'model', columns = 'train_test',
aggfunc='min').sort_values(('MAPE', 'test'), ascending=True)
table = (sorted_errors.sort_index(axis=1, level=1, ascending=False).sort_index(axis=1, level=[0], sort_remaining=False)).\
round(3)
table.style.highlight_min(['MAPE', 'MAE', 'RMSE'],
axis=0).highlight_max(['R2'], axis=0).highlight_null(null_color='gray')
_ = sorted_errors['MAPE'].plot()
_ = plt.xticks(ticks = range(0,len(sorted_errors)), labels = sorted_errors.index.values, rotation=90)
_ = plt.figure(figsize=(8,6))
Note: NaN values are for the models which couldn't be tested on a training set.
Based on the MAPE and RMSE scores, the XGBoost model with the Fourier terms has performed very well, predicting for a forecasting window of 8 months ahead. For an hourly data with multiple seasonalities that is a pretty impressive result.
For long term forecasts- most of the models have performed better than the baseline persistence model and the best model (XGBoost) gives a MAPE of 5.08% compared to the baseline error of 9.23% on the test set. The RMSE, R2 and MAE values are also considerably lower than the baseline model. The difference in RMSE with the baseline model is almost 160 MW which is pretty significant.
FB Prophet does a very good job in identifying the trend and seasonalities of the data. It can be paired with XGBoost and a more robust long term forecast can be obtained. The trend and cum_AC_kW regression coefficient information was extracted from trained FB Prohet model and it was used it to detrend the training set for XGBoost model. The XG Boost predictions were then combined with the predictions from both the models to give an aggregate forecast. Unfortunately, this didn't work very well as the forecast at the end of our time series seems to drop down at a faster rate than the observed.
While comparing the models, I am not considering the elastic net and random forest regression with all lags because they are only good for hour ahead forecasts. Even SARIMA does comparably well for short term hour-ahead or week ahead forecasts. So, pretty much any of the model given above using lag variables should be good for short term forecasts (95-98% R2 and 1-3% MAPE). Elastic net should be used for short term forecasts, given it had the highest accuracy and also the model training time is very less compared to SARIMAX.
SARIMA performs terribly bad on long term forecasts and doesn't capture the multiple seasonalities well. Maybe more feature engineering can be done to help SARIMA identify multiple seasonalities but given how time consuming the model training for SARIMA is, it is better to focus the resources on other models.